hypergraph

Finding "Patterns of Life" in NYC Taxi Traffic

In this notebook, we use tensor decompositions in order to find coherent "patterns of life" in NYC taxi trip records. We show that ENSIGN, when applied to spatiotemporal, entity-based data, can extract distinct travel, work, and leisure activities.

Table of Contents

In [1]:
# data manipulation 
import numpy as np 
import pandas as pd 

# ENSIGN tools
from ensign.csv2tensor import csv2tensor 
from ensign.cp_decomp import cp_apr, read_cp_decomp_dir, write_cp_decomp_dir
from ensign.visualize import plot_component 

# custom plotting
import plotly.express as px 

# Needed to display visuals in Jupyter Notebook
%matplotlib inline 

Data

We explore data provided by the NYC Taxi & Limousine Commision on taxi rides. The data includes features such as pickup and dropoff times and locations, the distance of the trip, the number of passengers, the payment amount and method, and more. As we are specifically interested in "patterns of life", or the types of trips and reasons for trips, we will focus on the trip times and locations. We consider just one week of data from June 13-19, 2016, but several years of data can be found and the T&LC site.

In [2]:
pd.read_csv('data/taxi_data.csv')
Out[2]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2016-06-13 00:02:54 2016-06-13 00:23:07 1 10.30 -73.853882 40.759350 2 N -73.970634 40.793297 2 52.0 0.0 0.5 0.00 5.54 0.3 58.34
1 2 2016-06-13 00:02:54 2016-06-13 00:20:45 2 9.47 -73.874580 40.773991 1 N -73.997139 40.736641 1 27.5 0.5 0.5 6.87 5.54 0.3 41.21
2 1 2016-06-13 00:02:55 2016-06-13 00:07:19 1 1.30 -73.956161 40.771927 1 N -73.967941 40.755821 1 6.0 0.5 0.5 1.45 0.00 0.3 8.75
3 1 2016-06-13 00:02:55 2016-06-13 00:08:56 1 1.00 -73.984879 40.748096 1 N -73.991730 40.754707 2 6.0 0.5 0.5 0.00 0.00 0.3 7.30
4 2 2016-06-13 00:02:55 2016-06-13 00:03:14 1 0.04 -73.950432 40.826599 1 N -73.950233 40.826557 2 2.5 0.5 0.5 0.00 0.00 0.3 3.80
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2572351 2 2016-06-19 23:45:21 2016-06-19 23:52:13 1 1.40 -73.968925 40.760899 1 N -73.991943 40.770603 1 7.5 0.5 0.5 1.00 0.00 0.3 9.80
2572352 1 2016-06-19 23:46:20 2016-06-19 23:50:48 1 0.60 0.000000 0.000000 1 Y 0.000000 0.000000 2 5.0 0.5 0.5 0.00 0.00 0.3 6.30
2572353 2 2016-06-19 23:46:39 2016-06-19 23:48:56 2 0.62 -74.002419 40.750240 1 N -73.994347 40.752590 2 4.0 0.5 0.5 0.00 0.00 0.3 5.30
2572354 2 2016-06-19 23:47:13 2016-06-19 23:59:23 5 4.30 -74.004814 40.725609 1 N -73.950722 40.723656 1 15.0 0.5 0.5 3.26 0.00 0.3 19.56
2572355 2 2016-06-19 23:49:58 2016-06-19 23:56:11 1 1.47 -73.984680 40.748310 1 N -73.979019 40.762081 2 7.0 0.5 0.5 0.00 0.00 0.3 8.30

2572356 rows × 19 columns

Tensor Construction

When constructing a tensor for decomposition, the main considerations are which features to select and how to discretize them. This discretization process, known as binning, ensures that similar values in the chosen dimensions have the same tensor index and results in more coherent patterns. The relevant data here are the time of the trip and the starting and ending locations, so we select five columns: pickup time, pickup latitude, pickup longitude, dropoff latitude, and dropoff longitude. We round the times to the nearest hour and round the coordinates to three points of precision. Moreover, we fuse the latitude and longitude so that the indices in those modes represent specific locations. The spatial binning, together with the fusing operation, results in mode indices corresponding roughly to city blocks.

In [3]:
tensor = csv2tensor(
    filepaths='data/taxi_data.csv', 
    columns=[ 
        'tpep_pickup_datetime', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'
    ],
    types=['datetime', 'float64', 'float64', 'float64', 'float64'], 
    binning=['hour', 'round=3', 'round=3', 'round=3', 'round=3'],
    sort=['tpep_pickup_datetime'],
    fuse_columns=[['pickup_longitude', 'pickup_latitude'], ['dropoff_longitude', 'dropoff_latitude']]
)

Tensor Decomposition

We use CP-APR to decompose the tensor because we constructed the tensor to have count entries. A rank-100 decomposition extracted coherent patterns, and raising the rank did not yield further components of interest.

In [4]:
decomp = cp_apr(tensor, 100, mem_limit_gb=16)
write_cp_decomp_dir('taxi_decomposition', decomp, write_tensor=True)

Evaluating the decomposition quality: The CPDecomp object provides a dictionary metrics as a field that contains information on the decomposition: running time, various quality metrics, and the number of completed iterations. It is not necessary to get a perfect fit of 1 in order to have high-quality, interpretable components. Here, the fit on the order of 10-1 coupled with a high cosine similarity indicate good decomposition results. Therefore, we can be confident that the decomposition components capture almost all of the activity in the original data.

In [5]:
decomp.metrics
Out[5]:
{'time': 441.4710237979889,
 'fit': 0.41096085146631833,
 'cosine_sim': 0.8081654036306326,
 'norm_scaling': 0.7982438044274881,
 'coverage': 0.998795747756958,
 'cp_total_iter': 100}

Component Visualization and Interpretation

We can visualize each component by plotting the scores in each mode vector involved in the outer product reconstructing that component. The labels along each mode correspond to the binned values created during tensor construction. Any tuple of scoring indices in the outer product is a tensor index involved in the pattern described by the component. Therefore, the labels of the scoring indices describe the pattern. Specifically, the hour mode indicates when the trips occur, and the pickup and dropoff modes indicate where the described trips started and ended. Reading these plots of components in this manner allows us to describe coherent trends in the data.

As latitude and longitude location data is not easily human readable, for selected components, we plot the high-scoring pickup and dropoff locations. The pickup and dropoff locations are color-coded and scaled by their score in the component.

In [6]:
# Custom plotting function for geographic visualization of this taxi decomposition
def taxi_plot(decomp, comp_id, zoom_level=12):
    df = pd.DataFrame(
        np.array([
            np.concatenate([decomp.factors[1][:, comp_id], decomp.factors[2][:, comp_id]]),
            np.concatenate([
                len(decomp.factors[1][:, comp_id])*['Pickup Location'], 
                len(decomp.factors[2][:, comp_id])*['Dropoff Location']
            ]),
            np.concatenate([
                [l.split('__') for l in decomp.labels[1]], [l.split('__') for l in decomp.labels[2]]
            ])[:, 0],
            np.concatenate([
                [l.split('__') for l in decomp.labels[1]], [l.split('__') for l in decomp.labels[2]]
            ])[:, 1]
        ]).T,
        columns=['score', 'type', 'lon', 'lat']
    )
    df['score'] = df['score'].astype(float)
    df['lat'] = df['lat'].astype(float)
    df['lon'] = df['lon'].astype(float)
    df = df[df['score'] != 0.0]

    fig = px.scatter_mapbox(df, lat="lat", lon="lon", zoom=zoom_level, height=800, size="score", color="type")
    fig.update_layout(mapbox_style="open-street-map")
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    fig.show()

Nightlife

Component 27 clusters together the taxi rides associated with nightlife. By plotting the scores in the mode vectors of the component, we see that only Friday and Saturday evening hours in the time mode have non-zero scores. Additionally, the high-scoring pickup and dropoff locations are the Meatpacking District, Greenwich Village, and the Lower East Side, all of which are known for their bars and restaurants. The pickup and dropoff locations are plotted on the map below.

In [7]:
plot_component(decomp, 27)
Out[7]:

The datetime pattern is especially isolated here, peaking at 1-2am on Saturday and Sunday mornings. Also, we can see a few people starting the weekend early on Thursday night!

The other two modes are better interpreted with the geographic visualization below.

In [8]:
taxi_plot(decomp, 27)

Penn Station Dropoffs

People frequently take cabs to Penn Station to catch trains leaving the city, so it is natural that there is a component capturing this behavior. The dominant dropoff location in Component 5 is Penn Station, and the pickup locations are in the surrounding area. These are plotted in the map below. The time mode in the component plots below show that these trips happen daily. Notably, there is a spike Monday through Friday at 6pm. This uptick in Penn Station dropoffs during those hours corresponds to the many commuters leaving the city for the day.

In [9]:
plot_component(decomp, 5)
Out[9]:
In [10]:
taxi_plot(decomp, 5)

Rides to an Airport

Component 22 consists mainly of dropoffs at JFK, LaGuardia, and Newark airports, while the pickups are mainly in Manhattan. These high-scoring locations are plotted in the map below. As expected, the time mode in the component plots below shows that these trips occurs daily. Note that each day's scores have a bimodal shape. That is, there are morning and afternoon peaks corresponding to when the majority of flights leave.

In [11]:
plot_component(decomp, 22)
Out[11]:
In [12]:
taxi_plot(decomp, 22, 10)

Key Takeaways

In this notebook, we used tensor decompositions to find distinct "patterns of life" in entity-based spatiotemporal data. We created a tensor encoding the time, pickup location, and dropoff location of taxi trips in NYC during one week, and we decomposed it to obtain encapsulations of behaviors in the original data. Each component represented one such behavior, ordered by weight, which ranged from rush-hour trips to Penn Station to taxi rides after a night out. Intuitively, by finding a compressed representation of the original tensor, the decomposition clustered similar trips together. Notably, we were able to isolate these coherent patterns from only three features with minimal feature engineering (we only discretized the range of possible values). There is nothing special about the taxi data, and tensor decompositions can summarize the activities found in any entity-based data. An interesting dataset for further exploration is Vessel Traffic Data, which is produced by all passenger vessels by law. It includes, among other features, ship identification, time, and location. After decomposing a tensor encoding these features, one would expect to find distinct components representing the behaviors of different types of ships, such as commerical ships in the Port of New York and New Jersey or recreational vehicles in the Long Island Sound.